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Abstract 

In this paper, we present an algorithm for unconstrained 
face verification based on deep convolutional features and 
evaluate it on the newly released IARPA Janus Benchmark 
A (IJB-A) dataset as well as on the traditional Labeled Face 
in the Wild (LFW) dataset. The IJB-A dataset includes real- 
world unconstrained faces from 500 subjects with full pose 
and illumination variations which are much harder than the 
LFW and Youtube Face (YTF) datasets. The deep convolu¬ 
tional neural network (DCNN) is trained using the CASIA- 
WebFace dataset. Results of experimental evaluations on 
the IJB-A and the LFW datasets are provided. 

1. Introduction 

Face verification is one of the core problems in com¬ 
puter vision and has been actively researched for over two 
decades ERTfl . In face verification, given two videos or im¬ 
ages, the objective is to determine whether they belong to 
the same person. Many algorithms have been shown to 
work well on images that are collected in controlled set¬ 
tings. However, the performance of these algorithms often 
degrades significantly on images that have large variations 
in pose, illumination, expression, aging, cosmetics, and oc¬ 
clusion. 

To deal with this problem, many methods have focused 
on learning invariant and discriminative representation from 
face images and videos. One approach is to extract over¬ 
complete and high-dimensional feature representation fol¬ 
lowed by a learned metric to project the feature vector 
into a low-dimensional space and to compute the similar¬ 
ity score. For instance, the high-dimensional multi-scale 
Local Binary Pattern (LBP) (5) features extracted from lo¬ 
cal patches around facial landmarks is reasonably effective 
for face recognition. Face representation based on Fisher 
vector (FV) has also shown to be effective for face recog¬ 
nition problems |26|[23|, However, deep convolu¬ 
tional neural networks (DCNN) have demonstrated impres¬ 
sive performances on different tasks such as object recog¬ 


nition ED ED, object detection M, and face verification 
1 25 1. It has been shown that a DCNN model can not only 
characterize large data variations but also learn a compact 
and discriminative feature representation when the size of 
the training data is sufficiently large. Once the model is 
learned, it is possible to generalize it to other tasks by fine- 
tuning the learned model on target datasets ED. In this 
work, we train a DCNN model using a relatively small face 
dataset, the CASIA-WebFace [[383, and compare the perfor¬ 
mance of our method with other commercial off-the-shelf 
face matchers on the challenging IJB-A dataset which con¬ 
tains significant variations in pose, illumination, expression, 
resolution and occlusion. We also evaluate the performance 
of the proposed method on the LFW dataset. 

The rest of the paper is organized as follows. We briefly 
review some related works in Section [2j Details of the dif¬ 
ferent components of the proposed method including the 
DCNN representation and joint Bayesian metric learning 
are given in Section [3] The protocol and the experimen¬ 
tal results are presented in Section]?] Finally, we conclude 
the paper in Section [5] with a brief summary and discussion. 

2. Related Work 

In this section, we briefly review several recent related 
works on face verification. 

2.1. Feature Learning 

Learning invariant and discriminative feature represen¬ 
tation is the first step for a face verification system. It can 
be broadly divided into two categories: (1) hand-crafted 
features, and (2) feature representation learned from data. 
In the first category, Ahonen et al. Q showed that the 
Local Binary Pattern (LBP) is effective for face recogni¬ 
tion. Gabor wavelets If39llf37l have also been widely used 
to encode multi-scale and multi-orientation information for 
face images. Chen et al. (6| demonstrated good results 
for face verification using the high-dimensional multi-scale 
LBP features extracted from patches around facial land¬ 
marks. In the second category, Patel et. al. f2?l and Chen 
et. al. HTTil iflQl applied dictionary-based approaches for im- 




Figure 1. An overview of the proposed DCNN approach for face verification. 


age and video-based face recognition by learning represen¬ 
tative atoms from the data which are compact and robust to 
pose and illumination variations . lf26l l23l (7) used the FV 
encoding to generate over-complete and high-dimensional 
feature representation for still and video-based face recog¬ 
nition. Lu et al. 1221 proposed a dictionary learning frame¬ 
work in which the sparse codes of local patches generated 
from local patch dictionaries are pooled to generate a high¬ 
dimensional feature vector. The high-dimensionality of fea¬ 
ture vectors makes these methods hard to train and scale to 
large datasets. However, advances in deep learning meth¬ 
ods have shown that compact and discriminative represen¬ 
tation can be learned using DCNN from very large datasets. 
Taigman et al. lf33l learned a DCNN model on the frontal- 
ized faces generated with a general 3D shape model from 
a large-scale face dataset and achieved better performance 
than many traditional face verification methods. Sun et al. 
12jLI 130J achieved results that surpass human performance 
for face verification on the LFW dataset using an ensemble 
of 25 simple DCNN with fewer layers trained on weakly 
aligned face images from a much smaller dataset than the 
former. Schroff et al. [25 ] adapted the state-of-the-art deep 
architecture for object recognition to face recognition and 
trained it on a large-scale unaligned private face dataset 
with the triplet loss. This method also achieved top per¬ 
formances on face verification problems. These works es¬ 
sentially demonstrate the effectiveness of the DCNN model 
for feature learning and detection/recognition/verification 
problems. 

2.2. Metric Learning 

Learning a similarity measure from data is the other key 
component that can boost the performance of a face veri¬ 
fication system. Many approaches have been proposed in 


the literature that essentially exploit the label information 
from face images or face pairs. For instance, Weinberger et 
al. f36) proposed Large Margin Nearest Neighbor(LMNN) 
metric which enforces the large margin constraint among all 
triplets of labeled training data. Taigman et al. lf32l learned 
the Mahalanobis distance using the Information Theoretic 
Metric Learning (ITML) method fl2l . Chen et al. (5) pro¬ 
posed a joint Bayesian approach for face verification which 
models the joint distribution of a pair of face images instead 
of the difference between them, and the ratio of between- 
class and within-class probabilities is used as the similarity 
measure. Hu et al. ITT) learned a discriminative metric 
within the deep neural network framework. Huang et al. 
fT8) learned a projection metric over a set of labeled im¬ 
ages which preserves the underlying manifold structure. 

3. Method 

Our approach consists of both training and testing stages. 
For training, we first perform face and landmark detection 
on the CASIA-WebFace, and the IJB-A datasets to local¬ 
ize and align each face. Next, we train our DCNN on the 
CASIA-WebFace and derive the joint Bayesian metric us¬ 
ing the training sets of the IJB-A dataset and the DCNN 
features. Then, given a pair of test image sets, we compute 
the similarity score based on their DCNN features and the 
learned metric. Figure [T] gives an overview of our method. 
The details of each component of our approach are pre¬ 
sented in the following subsections. 

3.1. Preprocessing 

Before training the convolutional network, we perform 
landmark detection using the method presented in Hll be¬ 
cause of its ability to be effective on unconstrained faces. 
Then, each face is aligned into the canonical coordinate 













































with similarity transform using the 7 landmark points (i.e. 
two left eye corners, two right eye corners, nose tip, and two 
mouth comers). After alignment, the face image resolution 
is 100 x 100 pixels, and the distance between the centers of 
two eyes is about 36 pixels. 

3.2. Deep Face Feature Representation 

A DCNN with small filters and very deep architecture 
(i.e. 19 layers in IZ7) and 22 layers in 1311 ) has shown to 
produce state-of-the-art results on many datasets including 
ImageNet 2014, LFW, and Youtube Face dataset. Stacking 
small filters to approximate large filters and to build very 
deep convolution networks not only reduces the number of 
parameters but also increases the nonlinearity of the net¬ 
work. In addition, the resulting feature representation is 
compact and discriminative. 

Our approach is motivated by 1381 . However, we only 
consider the identity information per face without modeling 
the pair-wise cost. The dimensionality of the input layer is 
100 x 100 x 1 for gray-scale images. The network includes 
10 convolutional layers, 5 pooling layers and 1 fully con¬ 
nected layer. The detailed architecture is shown in Table 
[I] Each convolutional layer is followed by a rectified linear 
unit (ReLU) except the last one, Conv52. Instead of sup¬ 
pressing all the negative responses to zero using ReLU, we 
use parametric ReLU (PReLU) fl 6 l which allows negative 
responses that in turn improves the network performance. 
Thus, we use PReLU as an alternative to ReLU in our work. 
Moreover, two local normalization layers are added after 
Convl2 and Conv22, respectively to mitigate the effect of 
illumination variations. The kernel size of all filters is 3 x 3. 
The first four pooling layers use the max operator. To gen¬ 
erate a compact and discriminative feature representation, 
we use average pooling for the last layer, pools. The fea¬ 
ture dimensionality of pool 5 is thus equal to the number of 
channel of Conv52 which is 320. Dropout ratio is set as 0.4 
to regularize Fc 6 due to the large number of parameters (i.e. 
320 x 10548.). To classify a large number of subjects in 
the training data (i.e. 10548), this low-dimensional feature 
should contain strong discriminative information from all 
the face images. Consequently, the pools feature is used for 
face representation. The extracted features are further L 2 - 
normalized into unit length before the metric learning stage. 
If there are multiple frames available for the subject, we use 
the average of the pools features as the overall feature repre¬ 
sentation. Figure [2] illustrates some of the extracted feature 
maps. 

3.3. Joint Bayesian Metric Learning 

To utilize the positive and negative label information 
available from the training dataset, we learn a joint Bayesian 
metric which has achieved good performances on face ver¬ 
ification problems dsn. Instead of modeling the differ¬ 


ence vector between two faces, this approach directly mod¬ 
els the joint distribution of feature vectors of both ith and 
j th images, {x^,Xj}, as a Gaussian. Let P(xi,Xj\Hi) ~ 
N( 0,1]/) when x$ and Xj belong to the same class, and 
P(xi,Xj|iT#) ~ N(0,J1e) when they are from different 
classes. In addition, each face vector can be modeled as, 
x = /x + e, where /lx stands for the identity and e for pose, 
illumination, and other variations. Both /x and e are as¬ 
sumed to be independent zero-mean Gaussian distributions, 
N( 0, S M ) and N( 0, S e ), respectively. 

The log likelihood ratio of intra- and inter-classes, 
r(xi, xj), can be computed as follows: 


r(x*,xj) 


log 


P(xj, Xj l-ffj) 

P(x il x j \H E ) 


x[ Mx^+xJ Mxj—2xf Rxj, 
( 1 ) 


where M and R are both negative semi-definite matrices. 
Equation ([!]) can be rewritten as (x^ — Xj) T M(x^ — Xj) — 
2xf BXj where B = R — M. More details can be found in 
0 . Instead of using the EM algorithm to estimate and 
S e , we optimize the distance in a large-margin framework 
as follows: 


argmin V max[l—yij(b— (x* — Xj) T M(x^—Xj)+ 2 xf Bxj), 0 ], 

M. B, 6 TT* 

( 2 ) 

where b G M is the threshold, and yij is the label of a pair: 
yij = 1 if person i and j are the same and yij = — 1 , 
otherwise. For simplicity, we denote (xi — Xj) T M(x^ — 
Xj) — 2xf Bxj as Xj). M and B are updated us¬ 

ing stochastic gradient descent as follows and are equally 
trained on positive and negative pairs in turn: 


f M t , if yij(b t - d M ,B(xi,Xj)) > 1 

\ Mt — 'yyijTij, otherwise, 


B _ f B t , if yij(b t - d M ,B(xi,Xj)) > 1 

t+1 \ Bt + 27 yijXixJ, otherwise, 

k _ f b t , if y^(b t — d M ,B(xt,Xj)) > 1 

t+1 —\ bt+^byij, otherwise, 

(3) 

where I\j = (x* — Xj)(x* — Xj) T and 7 is the learning 
rate for M and B, and 7 ^ for the bias b. We use random 
semi-definite matrices to initialize both M = VV T and 
B = WW T where both V and W G R dxd , and Vjj and 
Wij ~ 7V(0,1). Note that M and B are updated only when 
the constraints are violated. In our implementation, the ratio 
of the positive and negative pairs that we generate based 
on the identity information of the training set is 1:20. In 
addition, the other reason to train the metric instead of using 
traditional EM is that for IJB-A training and test data, some 
templates only contain a single image. More details about 
the IJB-A dataset are given in Section [4] 




Figure 2. An illustration of some feature maps of Convl 1, Conv21, and Conv31 layers. At the upper layers, the feature maps capture more 
global shape features which are also more robust to illumination changes than Convl 1. 


Name 

Type 

Filter Size/Stride 

Output Size 

Depth 

#Params 

Convll 

convolution 

3x3x1/1 

100x100x32 

1 

0.28K 

Convl 2 

convolution 

3x3x32/1 

100x100x64 

1 

18K 

Pooll 

max pooling 

2 x2/2 

50x50x64 

0 


Conv21 

convolution 

3x3x64 /1 

50x50x64 

1 

36K 

Conv22 

convolution 

3x3x64/1 

50x50x128 

1 

72K 

Pool2 

max pooling 

2 x2/2 

25x25x128 

0 


Conv31 

convolution 

3x3x128/1 

25x25x96 

1 

108K 

Conv32 

convolution 

3x3x96/1 

25x25x192 

1 

162K 

Pool3 

max pooling 

2 x2/2 

13x13x192 

0 


Conv41 

convolution 

3x3x192/1 

13x13x128 

1 

216K 

Conv42 

convolution 

3x3x128/1 

13x13x256 

1 

288K 

Pool4 

max pooling 

2 x2/2 

7x7x256 

0 


Conv51 

convolution 

3x3x256/1 

7x7x160 

1 

360K 

Conv52 

convolution 

3x3x160/1 

7x7x320 

1 

450K 

Pool5 

avg pooling 

7x7/1 

1x1x320 

0 


Dropout 

dropout (40%) 


1x1x320 

0 


Fc6 

fully connection 


10548 

1 

3296K 

Cost 

softmax 


10548 

0 


total 




11 

5006K 


Table 1. The architecture of DCNN used in this paper. 


3.4. DCNN Training Details 

The DCNN is implemented using caffe lfl9l and trained 
on the CASIA-WebFace dataset. The CASIA-WebFace 
dataset contains 494,414 face images of 10,575 subjects 
downloaded from the IMDB website. After removing the 
27 overlapping subjects with the IJB-A dataset, there are 
10548 subjects []and 490,356 face images. For each subject, 
there still exists several false images with wrong identity la¬ 
bels and few duplicate images. All images are scaled into 
[0,1] and subtracted from the mean. The data is augmented 
with horizontal flipped face images. We use the standard 
batch size 128 for the training phase. Because it only con¬ 
tains sparse positive and negative pairs per batch in addition 
to the false image problems, we do not take the verification 

x The list of overlapping subjects is available at http://www. 
umiacs.umd.edu/-pullpull/janus_overlap.xlsx 


cost into consideration as is done in l30l . The initial nega¬ 
tive slope for PReLU is set to 0.25 as suggested in fl6l . The 
weight decay of all convolutional layers are set to 0, and the 
weight decay of the final fully connected layer to 5e-4. In 
addition, the learning rate is set to le-2 initially and reduced 
by half every 100,000 iterations. The momentum is set to 
0.9. Finally, we use the snapshot of 1,000,000th iteration 
for all our experiments. 

4. Experiments 

In this section, we present the results of the proposed 
approach on the challenging IARPA Janus Benchmark A 
(IJB-A) (20), its extended version Janus Challenging set 2 
(JANUS CS2) dataset and the LFW dataset. The JANUS 
CS2 dataset contains not only the sampled frames and im¬ 
ages in the IJB-A but also the original videos. The JANUS 














































CS2 datase0 includes much more test data for identifi¬ 
cation and verification problems in the defined protocols 
than the IJB-A dataset. The receiver operating character¬ 
istic curves (ROC) and the cumulative match characteris¬ 
tic (CMC) scores are used to evaluate the performance of 
different algorithms. The ROC curve measures the per¬ 
formance in the verification scenarios, and the CMC score 
measures the accuracy in a closed set identification scenar¬ 
ios. 

4.1. JANUS-CS2 and IJB-A 

Both the IJB-A and JANUS CS2 contain 500 subjects 
with 5,397 images and 2,042 videos split into 20,412 
frames, 11.4 images and 4.2 videos per subject. Sample im¬ 
ages and video frames from the datasets are shown in Fig. [3] 
The videos are only released for the JANUS CS2 dataset. 
The IJB-A evaluation protocol consists of verification (1:1 
matching) over 10 splits. Each split contains around 11,748 
pairs of templates (1,756 positive and 9,992 negative pairs) 
on average. Similarly, the identification (1:N search) proto¬ 
col also consists of 10 splits which evaluates the search per¬ 
formance. In each search split, there are about 112 gallery 
templates and 1763 probe templates (i.e. 1,187 genuine 

probe templates and 576 impostor probe templates). On the 
other hand, for the JANUS CS2, there are about 167 gallery 
templates and 1763 probe templates and all of them are used 
for both identification and verification. The training set for 
both dataset contains 333 subjects, and the test set contains 
167 subjects. Ten random splits of training and testing are 
provided by each benchmark, respectively. The main differ¬ 
ences between IJB-A and JANUS CS2 evaluation protocol 
are (1) IJB-A considers the open-set identification problem 
and the JANUS CS2 considers the closed-set identification 
and (2) IJB-A considers the more difficult pairs which are 
the subsets from the JANUS CS2 dataset. 



Figure 3. Sample images and frames from the IJB-A and JANUS 
CS2 datasets. A variety of challenging variations on pose, illu¬ 
mination, resolution, occlusion, and image quality are present in 
these images. 

Both the IJB-A and the JANUS CS2 datasets are divided 
into training and test sets. For the test sets of both bench¬ 
marks, the image and video frames of each subject are ran¬ 
domly split into gallery and probe sets without any over¬ 
lapping subjects between them. Unlike the LFW and YTF 

2 The JANUS CS2 dataset is not publicly available yet. 


datasets which only use a sparse set of negative pairs to eval¬ 
uate the verification performance, the IJB-A and JANUS 
CS2 both divide the images/video frames into gallery and 
probe sets so that it uses all the available positive and neg¬ 
ative pairs for the evaluation. Also, each gallery and probe 
set consist of multiple templates. Each template contains 
a combination of images or frames sampled from multiple 
image sets or videos of a subject. For example, the size of 
the similarity matrix for JANUS CS2 split 1 is 167 x 1806 
where 167 are for the gallery set and 1806 for the probe set 
(i.e. the same subject reappears multiple times in different 
probe templates). Moreover, some templates contain only 
one profile face with challenging pose with low quality im¬ 
age. In contrast to the LFW and YTF datasets which only 
include faces detected by the Viola Jones face detector El, 
the images in the IJB-A and JANUS CS2 contain extreme 
pose, illumination and expression variations. These factors 
essentially make the IJB-A and JANUS CS2 challenging 
face recognition datasets |20|. 

4.2. Evaluation on JANUS-CS2 and IJB-A 

For the JANUS CS2 dataset, we compare the results of 
our DCNN method with the FV approach proposed in lf26l 
and two other commercial off-the-shelf matchers, COTS1 
and GOTS 120]. The COTS1 and GOTS baselines provided 
by JANUS CS2 are the top performers from the most recent 
NIST FRVT study (T5). The FV method is trained on the 
LFW dataset which contains few faces with extreme pose. 
Therefore, we use the pose information estimated from 
the landmark detector and select face images/video frames 
whose yaw angle are less than or equal to ±25 degrees for 
each gallery and probe set. If there are no images/frames 
satisfying the constraint, we choose the one closest to the 
frontal one. However, for the DCNN method, we use all 
the frames without applying the same selection strategy. ^ 
Figures]?] and [5] show the ROC curves and the CMC curves, 
respectively for the verification results using the previously 
described protocol where DCNN means using DCNN fea¬ 
ture with cosine distance, “ft” means finetuning on the 
training data, “metric” means applying Joint Bayesian met¬ 
ric learning, and “color” means to use all of the RGB 
images instead of gray-scale images. For the results of 
DCNN ft+metric, besides finetuning and metric learning, 
we also replace ReLU with PReLU and apply data augmen¬ 
tation (i.e. randomly cropping 100 x 100 -pixel subregions 
from a 125 x 125 region). For DCNN ft+metric+coio^\ 
we further use RGB images and larger face regions, (i.e. 
we use 125 x 125-pixel face regions and resize them into 
100 x 100-pixel ones.) Then, we show the fusion results, 

3 We fix the typos in [8j that the selection strategy is only applied to 
FV-based method, not for DCNN. 

4 DCNN ft+metric+color and DCNN fusion are our improved results 
for JANUS CS2 and IJB-A datasets obtained after the paper was accepted. 
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#Image: 6 



Template ID: 5494 
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Table 2. Query results. The first column shows the query images from probe templates. The remaining 5 columns show the corresponding 
top-5 queried gallery templates. 


DCNN f us i on , by directly summing the similarity scores of 
two models, DCNN f t+metric and DCNN /t+metric+coZor , 
where DCNN ft+metric is trained on gray-scale images with 
smaller face regions and DCNN ft+metric+color is trained 
on RGB images with larger face regions. From these fig¬ 
ures, we can clearly see the impact of each component to the 
improvement of final identification and verification results. 
From the ROC and CMC curves, we see that the DCNN 
method performs better than other competitive methods. 
This can be attributed to the fact that the DCNN model does 
capture face variations over a large dataset and generalizes 
well to a new small dataset. 

We illustrate the query samples in Table [2] The first 
column shows the query images from the probe templates. 
The remaining five columns show the corresponding top-5 
queried gallery templates (i.e. rank-1 means the most sim¬ 
ilar one, rank-2 the second most similar, etc.). For the first 
two rows, our approach can successfully find the subjects in 
rank 1. For the third, the query template only contains one 
image with extreme pose. However, in the corresponding 
gallery template for the same subject, it happens to contain 
only near-frontal faces. Thus, it failed to find the subject 
within the top-5 matches. To solve the pose generalization 
problem of CNN features, one possible solution is to aug¬ 
ment the templates by synthesizing faces in various poses 


with the help of a generic 3D model. We plan to pursue this 
approach in the near future, and we leave it for the future 
work. 

While this paper was under preparation, the authors be¬ 
came aware of (35), which also proposes a CNN-based ap¬ 
proach for face verification/identification and evaluates it on 
the IJB-A dataset. The method proposed in m combines 
the features from seven independent DCNN models. With 
finetuning on the JANUS training data and metric learning, 
our approach works comparable to (35) as shown in Fig¬ 
ure [5] Furthermore, with the replacement of ReLU with 
PReLU and data augmentation, our approach significantly 
outperforms (35l with only a single model. 

4.3. Labeled Face in the Wild 

We also evaluate our approach on the well-known LFW 
dataset using the standard protocol which defines 3,000 pos¬ 
itive pairs and 3,000 negative pairs in total and further splits 
them into 10 disjoint subsets for cross validation. Each sub¬ 
set contains 300 positive and 300 negative pairs. It contains 
7,701 images of 4,281 subjects. We compare the mean ac¬ 
curacy of the proposed deep model with other state-of-the- 

5 We correct the number reported in (8) previously for the IJB-A iden¬ 
tification task because one split of the identification task was performed 
partially due to the corrupted metadata, (i.e. Some images were missing at 
that time. The current metadata of IJB-A has fixed those errors already.) 

















(a) (b) 

Figure 4. Results on the JANUS CS2 dataset, (a) the average ROC curves and (b) the average CMC curves. 



(a) (b) 


Figure 5. Results on the IJB-A dataset, (a) the average ROC curves for the IJB-A verification protocol and (b) the average CMC curves for 
IJB-A identification protocol over 10 splits. 


IJB-A-Verif 

E3 

DCNN 

DCNN ft 

DCNNj£| m 

DCNN ft+m+c 

DCNN fusion 

FAR=le-2 

FAR=le-l 

0.732±0.033 

0.895±0.013 

0.573±0.024 

0.8±0.012 

0.64±0.045 

0.883±0.012 

0.787±0.043 

0.947±0.011 

0.818±0.037 

0.961±0.01 

0.838±0.042 

0.967±0.009 

IJB-A-Ident 

ED 

DCNN 

DCNN ft 

DCNN ft+ J] 

DCNNft+rn+c 

DCNN/ ns ^ on 

Rank-1 
Rank-5 
Rank-10 

0.820±0.024 

0.929±0.013 

N/A 

0.726±0.034 

0.84±0.023 

0.884±0.025 

0.799±0.036 

0.901±0.025 

0.934±0.016 

0.852±0.018 

0.937±0.01 

0.954±0.007 

0.882±0.01 

0.957±0.07 

0.974±0.005 

0.903 ±0.012 
0.965±0.008 
0.977±0.007 


Table 3. Results on the IJB-A dataset. The TAR of all the approaches at FAR=0.1 and 0.01 for the ROC curves. The Rank-1, Rank-5, and 
Rank-10 retrieval accuracies of the CMC curves where subscripts ft, m and c stand for finetuning, metric, and color respectively. 
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0.902±0.011 

0.876±0.013 

0.973±0.005 

0.904±0.011 

0.983±0.004 

0.921±0.013 

0.985±0.004 

CS2-Ident 

COTS1 

GOTS 

FV El 

DCNN 

DCNN f t 

DCNN ft+m 

DCNNj^| m ) c 

DCNN fusion 

Rank-1 
Rank-5 
Rank-10 

0.551±0.03 

0.694±0.017 

0.741±0.017 

0.413±0.022 

0.571±0.017 

0.624±0.018 

0.381±0.018 

0.559±0.021 

0.637±0.025 

0.694±0.012 

0.809±0.011 

0.85±0.009 

0.768±0.013 

0.874±0.01 

0.91±0.008 

0.838±0.012 

0.924±0.009 

0.949±0.006 

0.867±0.01 

0.949±0.005 

0.966±0.005 

0.891±0.01 

0.957±0.007 

0.972±0.005 


Table 4. Results on the JANUS CS2 dataset. The TAR of all the approaches at FAR=0.1 and 0.01 for the ROC curves. The Rank-1, Rank-5, 
and Rank-10 retrieval accuracies of the CMC curves where subscripts ft, m and c stand for finetuning, metric, and color respectively. 


art deep learning-based methods: DeepFace [33], DeepID2 
[375), DeepID3 (29), FaceNet (25) , Yi et al. (38) . Wang et 
al. (35), and human performance on the “funneled” LFW 
images. The results are summarized in Table [5] It can be 
seen from this table that our approach performs comparably 


to other deep learning-based methods. Note that some of the 
deep learning-based methods compared in Table [5] use mil¬ 
lions of data samples for training the model. Whereas we 
use only the CASIA dataset for training our model which 
has less than 500K images. 






























































Method 

#Net 

Training Set 

Metric 

Mean Accuracy d= Std 

DeepFace 1331 

1 

4.4 million images of 4,030 subjects, private 

cosine 

95.92% =b 0.29% 

DeepFace 

7 

4.4 million images of 4,030 subjects, private 

unrestricted, SVM 

97.35% zb 0.25% 

DeepID2 HD 

1 

202,595 images of 10,117 subjects, private 

unrestricted, Joint-Bayes 

95.43% 

DeepID2 

25 

202,595 images of 10,117 subjects, private 

unrestricted, Joint-Bayes 

99.15% zb 0.15% 

DeepID3 (29) 

50 

202,595 images of 10,117 subjects, private 

unrestricted, Joint-Bayes 

99.53% ±0.10% 

FaceNet (25) 

1 

260 million images of 8 million subjects, private 

L2 

99.63% ± 0.09% 

Yi et al. HD 

1 

494,414 images of 10,575 subjects, public 

cosine 

96.13% ±0.30% 

Yi et al. 

1 

494,414 images of 10,575 subjects, public 

unrestricted, Joint-Bayes 

97.73% ±0.31% 

Wang et al. f35l 

1 

494,414 images of 10,575 subjects, public 

cosine 

96.95% ± 1.02% 

Wang et al. 

7 

494,414 images of 10,575 subjects, public 

cosine 

97.52% ± 0.76% 

Wang et al. 

1 

494,414 images of 10,575 subjects, public 

unrestricted, Joint-Bayes 

97.45% ± 0.99% 

Wang et al. 

7 

494,414 images of 10,575 subjects, public 

unrestricted, Joint-Bayes 

98.23% ± 0.68% 

Human, funneled f35l 

N/A 

N/A 

N/A 

99.20% 

Ours 

1 

490,356 images of 10,548 subjects, public 

cosine 

97.15% ±0.7% 

Ours 

1 

490,356 images of 10,548 subjects, public 

unrestricted, Joint-Bayes 

97.45% ± 0.7% 


Table 5. Accuracy of different methods on the LFW dataset. 


4.4. Run Time 

The DCNN model is trained for about 9 days using 
NVidia Tesla K40. The feature extraction time takes about 
0.006 second per face image. In future, the supervised in¬ 
formation will be fed into the intermediate layers to make 
the model more discriminative and also to converge faster. 

5. Conclusion 

In this paper, we study the performance of a DCNN 
method on a newly released challenging face verification 
dataset, IARPA Benchmark A, which contains faces with 
full pose, illumination, and other difficult conditions. It was 
shown that the DCNN approach can learn a robust model 
from a large dataset characterized by face variations and 
generalizes well to another dataset. Experimental results 
demonstrate that the performance of the proposed DCNN 
on the IJB-A dataset is much better than the FV-based 
method and other commercial off-the-shelf matchers and is 
competitive for the LFW dataset. 

For future work, we plan to directly train a Siamese net¬ 
work using all the available positive and negative pairs from 
CASIA-Webface and IJB-A training datasets to fully utilize 
the discriminative information for realizing better perfor¬ 
mance. 
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